true

Summary

This post showcases the capabilities of the R package distantia to load, explore, and analyze epidemiological time series.

Setup

This tutorial requires the following packages:

library(distantia)
library(dtw)
library(zoo)
library(dplyr)
library(mapview)
library(reactable)
library(dygraphs)

Example Data

This demo focuses on two example data frames shipped with the package distantia: covid_counties and covid_prevalence.

covid_counties

Stored as a simple features data frame, covid_counties contains county polygons and several socioeconomic variables. It is connected to covid_prevalence by county name, which is stored in the column name.

The socioeconomic variables available in covid_counties are:

  • area_hectares: county surface.
  • population: county population.
  • poverty_percentage: population percentage below the poverty line.
  • median_income: median county income in dollars.
  • domestic_product: yearly domestic product in billions (not a typo) of dollars.
  • daily_miles_traveled: daily miles traveled by the average inhabitant.
  • employed_percentage: percentage of the county population under employment.

Please, take in mind that these variables were included in the dataset because they were easy to capture from on-line sources, not because of their importance for epidemiological analyses.

covid_prevalence

This data frame contains weekly Covid-19 prevalence in 36 California counties between 2020-03-16 and 2023-12-18. It is derived from a daily prevalence dataset available here.

The prevalence time series has the columns name, time, with the date of the first day of the week each data point represents, and prevalence, expressed as proportion of positive tests. The table below shows the first rows of this data frame.

Objective

This post focuses on the Covid-19 dataset described above to showcase the applications of distantia to the analysis of epidemiological time series.

DISCLAIMER: I am not an epidemiologist, so I will refrain from interpreting any results, and will focus on technical details about the usage of distantia insted.

Data Preparation

This section shows how to transform the data frame covid_prevalence into a format compatible with distantia.

A time series list, or tsl for short, is a list of zoo time series representing unique realizations of the same phenomena observed in different sites, times, or individuals. All data manipulation and analysis functions in distantia are designed to be applied to all time series in a tsl at once.

The function tsl_initialize() transforms time series stored as long data frames into time series lists.

tsl <- distantia::tsl_initialize(
  x = covid_prevalence,
  name_column = "name",
  time_column = "time"
)

Each element in tsl is named after the county the data belongs to.

names(tsl)
##  [1] "Alameda"         "Butte"           "Contra_Costa"    "El_Dorado"      
##  [5] "Fresno"          "Humboldt"        "Imperial"        "Kern"           
##  [9] "Kings"           "Los_Angeles"     "Madera"          "Marin"          
## [13] "Merced"          "Monterey"        "Napa"            "Orange"         
## [17] "Placer"          "Riverside"       "Sacramento"      "San_Bernardino" 
## [21] "San_Diego"       "San_Francisco"   "San_Joaquin"     "San_Luis_Obispo"
## [25] "San_Mateo"       "Santa_Barbara"   "Santa_Clara"     "Santa_Cruz"     
## [29] "Shasta"          "Solano"          "Sonoma"          "Stanislaus"     
## [33] "Sutter"          "Tulare"          "Ventura"         "Yolo"

The zoo objects within tsl also have the attribute name to help track the data in case it is extracted from the time series list.

attributes(tsl[[1]])$name
## [1] "Alameda"
attributes(tsl[[2]])$name
## [1] "Butte"

Each individual zoo object comprises a time index and a data matrix.

zoo::index(tsl[["Alameda"]]) |> 
  head()
## [1] "2020-03-16" "2020-03-23" "2020-03-30" "2020-04-06" "2020-04-13"
## [6] "2020-04-20"
zoo::coredata(tsl[["Alameda"]]) |> 
  head()
##   prevalence
## 1       0.01
## 2       0.01
## 3       0.01
## 4       0.01
## 5       0.01
## 6       0.02

Exploration

This section describes the tools in distantia that may help develop an intuition on the properties of the data at hand, either via visualization or descriptive analysis.

Visualization

The function tsl_plot() provides a static multipanel visualization of a large number of time series at once.

distantia::tsl_plot(
  tsl = tsl,
  columns = 3,
  guide = FALSE,
  text_cex = 1.2
)

Combined with tsl_subset(), it can help focus on particular counties and zoom-in over specific time periods.

distantia::tsl_plot(
  tsl = tsl_subset(
    tsl = tsl,
    names = c("Los_Angeles", "Kings"),
    time = c("2021-09-01", "2022-01-31")
  ),
  guide = FALSE
)

The individual zoo objects in tsl can be plotted right away using plot() or distantia::zoo_plot().

distantia::zoo_plot(
  x = tsl$Los_Angeles
  )

Another good option to plot zoo objects comes with the package dygraphs. The function dygraphs::dygraph() produces an interactive visualization that helps localize specific events in time and compare several time series at once.

dygraphs::dygraph(
  data = cbind(tsl$Los_Angeles, tsl$Kings), 
  ylab = "Covid-19 Prevalence"
  )


Descriptive Stats

Time series have at least two dimensions of interest: time, and the variable/s of interest, such as prevalence in this case.

Having a good understanding of the time features of a time series is important, because some data management decisions depend on it. Details like length, resolution, time units, or regularity may define how we design an analysis. In distantia, the function tsl_time() provides all relevant details about the time in your time series list.

df_time <- distantia::tsl_time(
  tsl = tsl
)

The table below shows the first five rows of df_time, with the name of the time series, the class of the time index in each zoo object, the units, length and resolution, and their time ranges.

Here, all time series are regular and have the same time resolution and range, which makes the table above rightfully boring.

Once acquainted with time, having a summary of the values in each time series also helps make sense of the main properties of the data. The function tsl_stats() summarizes several important statistical properties of the time series in tsl.

df_stats <- distantia::tsl_stats(
  tsl = tsl,
  lags = 1:6 #weeks
)

The first part of the output corresponds with common statistical descriptors, such as centrality and dispersion metrics, and higher moments.

The second part of the output, produced by the lags argument, shows time series autocorrelation (Moran’s I) for different time lags. Since the resolution of all time series is 7 days per data-point, each time lag represents a week. A value of 0.9 for the lag 1 indicates the Pearson correlation between the prevalence of each data point and its precedent neighbor.

The stats produced by tsl_stats() can be joined to the spatial data frame covid_counties using their common column name to link cases.

covid_counties <- dplyr::inner_join(
  x = covid_counties,
  y = df_stats,
  by = "name"
)

This join facilitates mapping any of the descriptive stats. For example, the map below shows the maximum prevalence per county across the complete time period.

mapview::mapview(
  covid_counties, 
  zcol = "q3",
  layer.name = "Max prevalence",
  label = "name",
  col.regions = grDevices::hcl.colors(n = 5, palette = "Zissou 1")
  )

The functions tsl_subset() and tsl_stats() can be combined to generate stats focused in a given period of time or sub-group of time series. As example, the code below illustrates how to generate the stats for the year 2021.

df_stats_2021 <- distantia::tsl_stats(
  tsl = tsl_subset(
    tsl = tsl,
    time = c("2021-01-01", "2021-12-31")
  ),
  lags = 1:6
)

Dissimilarity Analysis

The package distantia is specifically designed to compare time series, and provides two general methods to do so: Lock-Step (LS) and Dynamic time warping (DTW). This article provides a detailed guide on how to apply these methods with distantia.

Dynamic Time Warping and Lock-Step Comparison

The LS method directly compares samples observed at the same time in time series of the same length. It is computationally efficient and easy to interpret but cannot account for differences in outbreak speed or timing, potentially missing underlying similarities when outbreaks are not synchronized.

DTW warps the time axis of the compared time series to maximize shape similarity. It adjusts for differences in the speed and timing of outbreaks, such as epidemic waves peaking at different times in different regions, and works well with time series of varying lengths or temporal resolutions. However, DTW is computationally expensive for large data sets, can be harder to interpret, and is sensitive to data scaling.

Before going forward, let me show a little example to help develop a sense of how DTW and LS work on a small subset of our time series. The figure below shows two years of data for San Francisco, Napa, and Solano.

At first glance, Napa and Solano seem quite well synchronized and look very similar. On the other hand, San Francisco, in spite of having the same number of events as the other two, shows a timing difference of several months.

The table below shows the LS comparison between these time series. The psi column indicates the dissimilarity between pairs of time series.

tsl_smol |> 
  distantia::distantia_ls() |> 
  dplyr::mutate(
    psi = round(psi, 3)
  ) |> 
  reactable::reactable(
    resizable = TRUE,
    striped = TRUE,
    compact = TRUE,
    wrap = FALSE,
    fullWidth = FALSE
  )

As expected, LS captures the synchronicity between Napa and Solano quite well, and returns a psi score indicating a high similarity. Meanwhile, San Francisco shows a high psi score when compared with the other two time series, indicating dissimilarity due to its asynchronous pattern. So far so good!

Now, let’s take a look at the DTW result:

tsl_smol |> 
  distantia::distantia_dtw() |> 
  dplyr::mutate(
    psi = round(psi, 3)
  ) |> 
  reactable::reactable(
    resizable = TRUE,
    striped = TRUE,
    compact = TRUE,
    wrap = FALSE,
    fullWidth = FALSE
  )

Surprising, right? When cancelling out the time shift between time series, it turns out that San Francisco is more similar to Solano than Napa!

This happens because DTW ignores the dates of the observations, and can link one sample of one time series with one or several samples of the other time series. The plot below, done with dtw::dtw() and dtw::dtwPlotTwoWay(), shows first sample of San Francisco (first case of the dashed red line), with prevalence 0, is linked to all samples with prevalence 0 at the beginning of the Napa series. Since the distance of all these links is 0, this operation cancels out the difference in timing between both time series.

These DTW results indicate that, for the given period of time, San Francisco, Napa, and Solano might be parts of the same system, in spite of the time-delay between them. This is a conclusion that cannot be achieved when applying the lock-step method!

It is for this reason that DTW is preferable when outbreak timing varies significantly and shape similarity is more important than exact time alignment. On the other hand, LS is best applied when the question at hand is focused on time-series synchronization rather than shape similarity.

##Computing Dissimilarity with distantia

Now that the features of DTW and LS are clear, we are going to compute dissimilarity scores with both methods for the complete dataset.

#lock-step
psi_ls <- distantia::distantia_ls(tsl = tsl)

#dynamic time warping
psi_dtw <- distantia::distantia_dtw(tsl = tsl)

The p-value represents the probability of finding a lower psi (higher similarity) when the time series are permuted. Low p-values indicate a strong similarity, better than expected by chance.